4 research outputs found
Exploring the Relationship between LLM Hallucinations and Prompt Linguistic Nuances: Readability, Formality, and Concreteness
As Large Language Models (LLMs) have advanced, they have brought forth new
challenges, with one of the prominent issues being LLM hallucination. While
various mitigation techniques are emerging to address hallucination, it is
equally crucial to delve into its underlying causes. Consequently, in this
preliminary exploratory investigation, we examine how linguistic factors in
prompts, specifically readability, formality, and concreteness, influence the
occurrence of hallucinations. Our experimental results suggest that prompts
characterized by greater formality and concreteness tend to result in reduced
hallucination. However, the outcomes pertaining to readability are somewhat
inconclusive, showing a mixed pattern
FACTIFY-5WQA: 5W Aspect-based Fact Verification through Question Answering
Automatic fact verification has received significant attention recently.
Contemporary automatic fact-checking systems focus on estimating truthfulness
using numerical scores which are not human-interpretable. A human fact-checker
generally follows several logical steps to verify a verisimilitude claim and
conclude whether its truthful or a mere masquerade. Popular fact-checking
websites follow a common structure for fact categorization such as half true,
half false, false, pants on fire, etc. Therefore, it is necessary to have an
aspect-based (delineating which part(s) are true and which are false)
explainable system that can assist human fact-checkers in asking relevant
questions related to a fact, which can then be validated separately to reach a
final verdict. In this paper, we propose a 5W framework (who, what, when,
where, and why) for question-answer-based fact explainability. To that end, we
present a semi-automatically generated dataset called FACTIFY-5WQA, which
consists of 391, 041 facts along with relevant 5W QAs - underscoring our major
contribution to this paper. A semantic role labeling system has been utilized
to locate 5Ws, which generates QA pairs for claims using a masked language
model. Finally, we report a baseline QA system to automatically locate those
answers from evidence documents, which can serve as a baseline for future
research in the field. Lastly, we propose a robust fact verification system
that takes paraphrased claims and automatically validates them. The dataset and
the baseline model are available at https: //github.com/ankuranii/acl-5W-QAComment: Accepted at ACL main conference 202
The Troubling Emergence of Hallucination in Large Language Models -- An Extensive Definition, Quantification, and Prescriptive Remediations
The recent advancements in Large Language Models (LLMs) have garnered
widespread acclaim for their remarkable emerging capabilities. However, the
issue of hallucination has parallelly emerged as a by-product, posing
significant concerns. While some recent endeavors have been made to identify
and mitigate different types of hallucination, there has been a limited
emphasis on the nuanced categorization of hallucination and associated
mitigation methods. To address this gap, we offer a fine-grained discourse on
profiling hallucination based on its degree, orientation, and category, along
with offering strategies for alleviation. As such, we define two overarching
orientations of hallucination: (i) factual mirage (FM) and (ii) silver lining
(SL). To provide a more comprehensive understanding, both orientations are
further sub-categorized into intrinsic and extrinsic, with three degrees of
severity - (i) mild, (ii) moderate, and (iii) alarming. We also meticulously
categorize hallucination into six types: (i) acronym ambiguity, (ii) numeric
nuisance, (iii) generated golem, (iv) virtual voice, (v) geographic erratum,
and (vi) time wrap. Furthermore, we curate HallucInation eLiciTation (HILT), a
publicly available dataset comprising of 75,000 samples generated using 15
contemporary LLMs along with human annotations for the aforementioned
categories. Finally, to establish a method for quantifying and to offer a
comparative spectrum that allows us to evaluate and rank LLMs based on their
vulnerability to producing hallucinations, we propose Hallucination
Vulnerability Index (HVI). We firmly believe that HVI holds significant value
as a tool for the wider NLP community, with the potential to serve as a rubric
in AI-related policy-making. In conclusion, we propose two solution strategies
for mitigating hallucinations
Counter Turing Test CT^2: AI-Generated Text Detection is Not as Easy as You May Think -- Introducing AI Detectability Index
With the rise of prolific ChatGPT, the risk and consequences of AI-generated
text has increased alarmingly. To address the inevitable question of ownership
attribution for AI-generated artifacts, the US Copyright Office released a
statement stating that 'If a work's traditional elements of authorship were
produced by a machine, the work lacks human authorship and the Office will not
register it'. Furthermore, both the US and the EU governments have recently
drafted their initial proposals regarding the regulatory framework for AI.
Given this cynosural spotlight on generative AI, AI-generated text detection
(AGTD) has emerged as a topic that has already received immediate attention in
research, with some initial methods having been proposed, soon followed by
emergence of techniques to bypass detection. This paper introduces the Counter
Turing Test (CT^2), a benchmark consisting of techniques aiming to offer a
comprehensive evaluation of the robustness of existing AGTD techniques. Our
empirical findings unequivocally highlight the fragility of the proposed AGTD
methods under scrutiny. Amidst the extensive deliberations on policy-making for
regulating AI development, it is of utmost importance to assess the
detectability of content generated by LLMs. Thus, to establish a quantifiable
spectrum facilitating the evaluation and ranking of LLMs according to their
detectability levels, we propose the AI Detectability Index (ADI). We conduct a
thorough examination of 15 contemporary LLMs, empirically demonstrating that
larger LLMs tend to have a higher ADI, indicating they are less detectable
compared to smaller LLMs. We firmly believe that ADI holds significant value as
a tool for the wider NLP community, with the potential to serve as a rubric in
AI-related policy-making.Comment: EMNLP 2023 Mai